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We present a detailed derivation of some estimators of Shannon entropy for discrete distributions. 
They hold for finite samples of N points distributed into M "boxes", with N and M — > oo, but 
N/M < oo. In the high sampling regime (3> 1 points in each box) they have exponentially small 
biases. In the low sampling regime the errors increase but are still much smaller than for most other 
estimators. One advantage is that our main estimators are given analytically, with explicitly known 
analytical formulas for the biases. 



It is well known that estimating (Shannon) entropies 
from finite samples is not trivial. If one naively re- 
places the probability pi to be in "box" i by the ob- 
served frequency, pi w m/N, statistical fluctuations 
tend to make the distribution look less uniform, which 
leads to an underestimation of the entropy There have 
been numerous proposals on how to estimate the bias 
EE 0, 1 ! !, S m & H M HO, E3, El • Some make quite 
strong assumptions 0, 0] , others use Bayesian methods 
[1, [U OjJ . As pointed out in 0, [l]| , one can devise es- 
timators with arbitrarily small bias (for sufficiently large 
iV and fixed pt), but these will then have very large sta- 
tistical errors (if sufficiently many of the n.; are small 
but 7^ 0). In the present paper we want to revisit a 
method used in 0] . There a very simple correction term 
was derived which seems to be a very good compromise 
between bias, statistical errors, and ease of use. Unfortu- 
nately, the treatment in [J] was not quite systematic, and 
in particular the corrections going beyond the proposed 
term were wrong. It is the purpose of the present letter 
to provide a more systematic presentation of the method 
used in 0], to correct some of the errors made there, and 
to propose an estimator which is again very easy to use 
and which should be better than that proposed in 0] . 

We consider M ^> 1 "boxes" (states, possible exper- 
imental outcomes, ...) and N 3> 1 points or particles 
distributed randomly and independently into the boxes. 
We assume that each box has weight pi (i — 1, . . . M) 
with ^jPi = 1. Thus each box i will contain a random 
number rii of points, with E[rii] = PiN. Their distribu- 
tion is binomial, 



P( ni ; Ph N)=[ N )p?(l- Pi ) 
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Since entropy H is a sum over terms each of which de- 
pends only on one index i, we only need these marginal 
distributions instead of the more complicated and non- 
factorizing joint distribution. Some of the pi can be zero, 
but in the following we shall assume that none of them 
is large, i.e. pi <C 1 for all i. In that limit the numbers 
m are Poisson distributed, 



P(rii\ & 



(2) 



with Zi = E[rii] = p%N . The error in going from Eq.JI]) 
to ^ is 0(1/ N). Thus all derivations given below hold 
strictly only in the limit N — > oo, M — > oo, rii/N — > 
Vi, but the general case is not much more difficult, see 
footnote (HI- 

Our aim is to estimate the entropy, 



M M 

H = - )^Pi In ffj = In N - — Zj In z 
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from an observation of the numbers {rii} (in the follow- 
ing, all entropies are measured in "natural units" , not in 
bits). The estimator H(n\, . . . um ) will of course have 
both statistical errors and a bias, i.e. if we repeat this 
experiment, the average of H will in general not be equal 
to H, 



AH = E[H] -H^O. 



(4) 



In the limit N — * oo,M — * oo, the statistical error 
will go to zero (because essentially one averages over 
many boxes), but the bias will remain finite unless also 
rii — > oo Vi in this limit, which we will not assume in the 
following. Indeed it is well known that the naive estima- 
tor, obtained by assuming Zi — rii without fluctuations, 
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is negatively biased, AH na i ve < 0. 

In the limit of large N and M each contribution Zi In zi 
to the entropy will be statistically independent, and can 
thus also be estimated independently by some estimator 
which is only a function of rii 0]) 



Zi In Zi w Zi In Zi = n^n;) 
such that its expectation value is 



E[^lnzi] = rii4>(rii)P(ni;Zi). 



(6) 



(7) 



n i = l 



Notice that the sum here runs only over strictly positive 
values of m . Effectively this means that we have assumed 
that observing an outcome rii = does not give any 
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information: If rij = 0, we do not know whether this is 
because of statistical fluctuations or because Pi = for 
that particular i. 

The resulting entropy estimator is then [3| [HJ 



M- 



H 4> = \nN - — n^(n) 



(8) 



with the overbar indicating an average over all boxes, 

M 
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n4>(n) = —^n^n,). 



(9) 



Its bias is 



M 



AH <t> = —(zlnz-E[nc} > (n)]). 



(10) 



It will turn out that some of the derivations given be- 
low simplify if we consider instead of the Shannon case 
the more general Renyi entropies, 



We write T(n+l)/T(n+l+q) = B(n + l,q)/T(q) and use 
the integral representation for the beta function (Ref. [HI , 
paragraph 6.2.1) 

B(n + l,q)= f dt {l-t) n t q ~\ (16) 
Jo 

Since both this integral and the sum over n in the defini- 
tion of A{— q, z) are absolutely convergent, we can inter- 
change them. The sum can then be done exactly, giving 



A(-q,z) 
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r(g) Jo 



r(i + ? ) 
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The last term arises since the sum over n extends only 
from 1 to oo. Writing now J Q Z = J Q °° — we can express 
the first term as a Gamma function and the second as an 
incomplete Gamma function ((l5j. paragraph 6.5.3), 



m = A ln Erf 

1 i=l 

1 M 
= —q [ln ^ Z *- qlnN] - 

i— 1 

The Shannon case is recovered by taking the limit q — ► 1, 
iJ = Um 9 _>.i if (g). Eqs.dH]) to (fTUl) are then replaced by 

z? = m4>(ni,q) with 0(n) = d0(n, g)/dg|g=i, E[z 2 9 ] = 
Y^ n . n i (f>(n i ,q)P(n i ; Zj), and 

Aexp((l - q)H(q)) 4> = ^(i*- EH(«, <?)]). (12) 

For integer g > 2, the bias- free estimator is given by 
(in the following we shall suppress the index i) 



z« 



(n-q)V 

since the factorial moments satisfy 

oo i 



(13) 



(14) 



This suggests that it might be a good strategy to look 
first at the generalization of the l.h.s. for arbitrary g, and 
then analyze more closely the difference with z q . In addi- 
tion, we will see that we should start with negative real g, 
and go to positive q only later by analytic continuation. 
We thus define 



A(-q,z) = 

= E 



r(n + i) 



r(n + 1 + g) n! 
F(n+1) 
r(n + l + g) 



(15) 



r(«) 



r(i + 5 ) 
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Here we can finally continue analytically to positive q. 
Furthermore we use the recursion relation (Ref. [l5[ , para- 
graph 6.5.22) 



I e~ z z a 

T(a,z) = -T(l + a,z) 

a a 



(19) 



to arrive finally at 

cr r(n + i) 



T(n + 1 -q) 



r(i-?) 



T(l-q,z). (20) 



For the Shannon case we take the derivative with respect 
to q at q = 1 and obtain [l4| 

E[ml>(ri)] = z\nz + zEi_{z) . (21) 

Here, ip{ x ) = dh\T{x)/dx is the digamma function, and 



= T(0,a;) 



(22) 



is an exponential integral (Ref. [la] . paragraph 5.1.4). 

Eq. (l2~lj) is our first important result. For large values 
of z, zEi(z) w e~ z . Thus, if z = E[n] is large, it is 
an exponentially good approximation to simply neglect 
the last term in Eg . (|2"T|) . We call the resulting entropy 
estimator H,i, [3] [Ti]. 



1 M 

= In AT - — }^niip(ni). 



(23) 



Moreover, for z — > we have also zE\{z) — > 0, and in 
between and oo the function is positive with a single 
maximum at z = 0.434... where zE\(z) = 0.2815.... If 
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we simply neglect the last term, we make thus a negative 
bias, but at most by 

< -AH, P = zEi(z)M/N < 0.2815 . . . x M/N. (24) 

If we approximate further ip( x ) ~ Ins, we obtain the 
naive estimator. The better approximation ip{x) ~ lux — 
l/2x gives Miller's correction It can be shown that 

E[nlnra] > E[nlnn- 1/2] > E[ntp(n)] > z\nz (25) 

for all positive z. Thus both the naive estimate and 
Miller's correction are worse than H^. The difference 
is especially big for large z, where the error of the 
naive estimate goes to M/2N, the error after applying 
Miller's correction is ~ M/zN, while the error of is 
~ exp(-z)M/N. 

But we can do even better. First we notice that 



^ n + 1 n! 

n— 1 



(26) 



which has the same leading behaviour for large z as 
zEi(z). It also goes to zero for z — > 0, is positive for 
all z G [0, oo), and is smaller than zEi(z) for all z. Thus, 
replacing tp(n) by 



(-1)" 
n(n + 1) 



(27) 



gives an improved estimator. Apart from a misprint, this 
is the estimator recommended in Q, Eq.(13). 

This equation had been derived in [J] somewhat un- 
systematically, using asymptotic series expansions in an 
uncontrolled way. Because of that, the discussion of the 
more general approximation, Eq.(ll) in that paper, is 
wrong. In particular, Eq.(ll) holds (for q — > 1) not for 
all integer R, but only for odd values of R. Furthermore, 
the fact that the terms neglected in Eq.(ll) decrease as 
z~ R e~ z for large z does not mean that Eq.(ll) is exact 
in the limit R — ► oo. Finally, in contrast to what is said 
there, this limit can be taken without a risk of statistical 
errors blowing up, at least for q — > 1. 

Instead of following the derivation of we consider 
the semi-infinite sequence of real numbers G\ , Gi , . . . de- 
fined by 

Gi = -7 -In 2, 
G 2 = 2-7-ln2, 

G2n+1 = Gin, (28) 

(here, 7 = 0.577215 ... is Euler's constant) and 
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Gm+2 — G; 



2n 



2n + 1 



(n > 1). 



(29) 



ThusG 2 „ = -7-ln2 + 2/l + 2/3 + 2/5 + ...+2/(2n-l). 
Using the representation i/)(n) — —7 + 1/1 + 1/2 + 1/3 + 
. . . + l/(n — 1), one checks that 



G n =iP(n) + (-iy 



1 x"- 1 
x + 1 



dx. (30) 



On the one hand, using formula 0.244 of [16j, one can 
write this integral as an infinite sum, 



G n =V(n) + (-l) n £ 



1 



(rc + 2/)(n + 2Z + l) 



, (31) 



which can be compared to Eq.(ll) of [ij with q—*l and 
odd R — > 00. On the other hand, we obtain 



E[n(G n - V(n))] = V n(G n - ip(n))-re 

t — ^ 1 



n=l 



1 dx _ r ^ i-xz) 11 - 1 

e z > ^7 ^— — 

x+1 ^ (n-l)\ 



= —e z 



1 dx 



/o a; + l 
= -^(2z)). 

Therefore, combining this with Eq. (|21[) , we have 

E[nG„] = z\nz + zE x {2z). 



(32) 



(33) 



This is our main result. Since the last term decreases 
as e~ 2z , the error made when neglecting it decreases ex- 
ponentially faster with z = E[n] than when neglecting 
the last term in Eq. (|2"Tj) . for large z. Thus, if all boxes 
have E[rij] > 5, say, the error committed is < e -10 which 
should be negligible in all practical cases. More gener- 
ally, the error made by neglecting the last term is again 
always negative, and it is bounded by 



< -AH G < 0.1407. 



M/N, 



where [14 [ 



1 M 

H G =\nN--Y j n l G ni 



(34) 



(35) 



is our proposed best estimator. 

Let us denote by z* = 0.217 . . . the position of the 
maximum of zE\(2z). For z < z* this function is convex. 
Thus, if N/M < z*, the distribution of z- values over the 
boxes which gives the maximal bias is a delta function, 
P(z) = S(z - N/M), and Eq.([5i| can be improved to 
-AH G < Ei(2N/M). For N/M -> this diverges ~ 
ln(M/iV)- 

We might add that truncating the sum in Eq. (|3"Tj) at 
any finite I also gives valid estimators whose errors are 
between those of He and , but there seems no reason 
to prefer any of them over Hq or H^. Taking only the 
term with / = gives Eq.([2"T|). 

The error terms E [n0(n)] — z In z for 4>{n) = In n, Inn — 
1/2, ip(n), ip(n) + (—l) n /n/(n + 1), and G n are shown 
in Fig.l, together with one more curve discussed below. 
The functions 4>(n) themselves are shown in Fig. 2. 

We can give estimators with even smaller absolute bias, 
i.e. with \ AH\ < 0.1407 . . ., but they have several draw- 
backs: 
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-0.1 



<|>(n) = In n 
<]>(n) = In n - 1/2n 
*(n) = V(n) 
(|>(n)=\|/(n) + (-1) n /n/(n+1) 
<Mn) = G n 
cf>(n) annealed 



3 

z = <n> 



FIG. 1: Error terms for fixed z = E[n] and for different func- 
tions <f)(n). While the first five are analytic, the last one is 
just one typical simulated annealing result. Different cost 
functions, annealing schemes, and random number sequences 
will give slightly different results. 
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4>(n) = In n 
(|>(n) = In n - 1/2n 

<Mn) = V(n) 
n) = \|/(n) + (-1) n /n/(n+1) 
«t>(n) = G n 
df(n) annealed 
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FIG. 2: Functions <f>(n) corresponding to the error terms 
shown in Fig.l. Notice that they are defined only for inte- 
ger n. Values at non-integer n are just plotted to guide the 
eye. 



• Their biases can have either sign. 

• We were only able to find them numerically, by 
minimizing (by simulated annealing) a cost func- 
tion like e.g. the L 2 norm 



dz 



^2n<p(n)'- 



zhiz\ 



(36) 



71=1 



Typical results obtained in this way are shown in 
Figs.l and 2 [13 . 



The resulting function 4>(n) replacing ip{n) resp. 
G n is not monotonic, and its total variation as mea- 
sured e.g. by n [4>( n ) — G n } 2 would diverge 
as 5 — > (indeed, the results shown in Figs.l and 2 
were obtained by adding 0.0002 times this term as 
a regularizer to the L 2 norm) . This is the most se- 
rious drawback. It means that large cancellations 
must occur and thus statistical errors blow up in 
the limit 8 — > (if TV is kept finite), as is to be 
expected on general grounds |4j. There cannot be 
any estimator of H completely free of bias for finite 
N. Notice that G n is the "best" sequence which is 
still monotonic. Estimates based on non-monotonic 
4>{ri) might be useful if one has important contribu- 
tions from extremely small z, i.e. if either JV<M 
or if the distribution of pi is so uneven that many 
boxes have small (but not too small) z%. 



I have applied the above estimators to the six exam- 
ples shown in Fig. 4 of [12] • In each of these examples the 
number of boxes was M > 1000, although the number of 
non-empty boxes was smaller in some of them. Never- 
theless, the distributions were severely undersampled in 
most cases when N < 300. In all cases the annealed <j>{n) 
shown in Fig. 2 gave statistical errors smaller or compa- 
rable to the Bayesian estimators of [l2j , and the bias was 
smaller than the statistical errors for all N > 300. In all 
but two cases (Zipf's law and (3=1, with /3 defined in 
[l2l ]) the bias was negligible even down to N = 10. With 
Eq. (f35|) , the bias was significant (> 2a) in the same two 
cases for all N < 300, and in the case f3 = 0.02 for 
N = 10. 

In summary, I hope to have clarified the arguments 
and corrected the mistakes made in and I have sub- 
stantially improved on the results. I have proposed a new 
analytic estimator for Shannon entropy which has very 
small systematic errors, except when the average num- 
ber of points per box is much smaller than 1. Its sta- 
tistical errors should be larger than those of the naive 
estimator (since there contributions from rij = 1 and 
from rii > 1 partially cancel), but this difference should 
be small. In addition, it is shown that numerically ob- 
tained estimators can be useful for extremely undersam- 
pled cases. The estimator and the first correction 
based on Eq. (|2"Tj) can be generalized straightforwardly to 
Renyi entropies, but I was not able to generalize the new 
estimator, Eq. (j3l)|) , to q ^ 1. The present estimators can 
not match the best Bayesian estimators [12| when the 
sampling is extremely low, but they are much simpler to 
use and more robust, as no guess of any prior distribution 
is needed. 

I want to thank Walter Nadler for carefully reading the 
manuscript, and to Liam Paninski for correspondence. 
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